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Abstract Principal component analysis (PCA) 
is a well-established method commonly used 
to explore and visualise data. A classical PCA 
model is the fixed effect model where data are 
generated as a fixed structure of low rank cor- 
rupted by noise. Under this model, PCA does 
not provide the best recovery of the under- 
lying signal in terms of mean squared error. 
Following the same principle as in ridge re- 
gression, we propose a regularised version of 
PCA that boils down to threshold the singu- 
lar values. Each singular value is multiplied by 
a term which can be seen as the ratio of the 
signal variance over the total variance of the 
associated dimension. The regularised term is 
analytically derived using asymptotic results 
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and can also be justified from a Bayesian treat- 
ment of the model. Regularised PCA provides 
promising results in terms of the recovery of 
the true signal and the graphical outputs in 
comparison with classical PCA and with a soft 
thresholding estimation strategy. The gap be- 
tween PCA and regularised PCA is all the 
more important that data are noisy. 

Keywords principal component analysis • 
singular value thresholding ■ regularised PCA • 
fixed effect model • visualisation • denoising. 

1 Introduction 

Principal component analysis (PCA) is a well- 
established method which can be used to ex- 
plore and visualise data. PCA can be presented 
from a geometrical or a model point of view. 
A classical PCA model is the fixed effect one 



(Caussinus 1986 ) where a data matrix of n in- 



dividuals and p variables (considered to be cen- 
tered) is generated as a fixed structure having 
a low rank representation corrupted by noise: 

Xnxp = X raX p + Snxp (1) 
S 

Xi J = ^2 ^ ~&»1i» T 3» + £ i] ' £ iJ ~ N(0, (J 2 ) 

3 = 1 

where d s is the s th eigenvalue of the matrix 
X'X (n times the true covariance matrix), r s = 



2 
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.,r ps } is the associated eigenvec- 



tor and q s = {qu, — , Qis, —1.9ns} is the s th 
eigenvector of the matrix XX' (n times the 
true inner-product matrix) . Under this model, 
the individuals have different expectations and 
randomness is only due to the error term. This 
is in agreement with situations where PCA 
is performed on data where the individuals 
are objects of interest and are not a random 
sample drawn from a population of individu- 
als. Such situations frequently arise in prac- 
tice. For instance, in sensory analysis, individ- 
uals can be products, such as chocolates, and 
variables can be sensory descriptors, such as 
bitterness, sweetness, etc. The aim is to study 
these specific products and not others (they 
are not interchangeable). It thus makes sense 
to estimate the individual parameters (q s ) and 
to study the graphical representation of the in- 
dividuals as well as the representation of the 
variables. In addition, let us point out that 
the inferential framework associated with this 
model is not usual. Indeed the number of pa- 
rameters increases when the number of indi- 
viduals increases. Consequently, in this model, 
the asymptotic results are obtained consider- 
ing the noise variance tends to 0. 

The maximum likelihood estimators of this 
model correspond to the least squares ones, i.e. 
the usual PCA solution. More precisely, PCA 
finds a matrix X„ xp , of low rank S, which min- 
imises || X — X|| 2 with ||*|| the Frobenius norm. 
The solution is given by the singular value de- 
composition (SVD) of X: 



= \Z%u is v 



"3 a 



(2) 



s = l 



where A s is the s th eigenvalue of X'X, u s = 
{tils, •••) Ui s , u ns } the s th left singular vec- 
tor and v s = {vis, ...,Vj s , ...,v ps } the s th right 
singular vector. 

It is established, in regression for instance, 
that the maximum likelihood estimators are 
not necessarily the best ones in terms of mean 
squared error (MSE). However, shrinkage es- 
timators, although biased, have smaller vari- 
ance which may reduce the MSE. We follow 



this approach and propose a regularised ver- 
sion of PCA in order to get a better estimate 
of the underlying structure X and graphical 
representations which are as close as possible 
to the representations that would be obtained 
from the signal only. As we will show later, the 
approach boils down to threshold the singular 
values with a different amount of shrinkage for 
each singular value. The shrinkage term will 
be analytically derived. 

Singular value thresholding is a popular es- 
timation strategy to recover a low rank sig- 
nal from noisy data. The most classical ap- 
proach consists in using soft thresholding. It 
boils down to threshold each singular value 
with a same amount of shrinkage usually deter- 
mined by cross-validation. However, recently, 



Candes et al ( 2012[ ) proposed to determine the 



threshold level without resorting to a compu- 
tational method by minimising an estimate of 
the MSE namely a Stein's Unbiased Risk Esti- 
mate (SURE). We will confront our approach 
to the SURE method. 

In addition, it is worth quoting the work 



of Takane and Hwang] ( |2006[ ) and |Hwang et al 
( 2009 1 who proposed, in an exploratory frame- 



work, a regularised version of multiple corre- 
spondence analysis (MCA). Their aim is to 
get parameters and graphical representations 
which are as close as possible to the "true" 
ones in terms of MSE. Their approach is thus 
close to our proposal but there are three main 
differences. Firstly, MCA is classically applied 
to survey data, where a sample of individuals 
is studied. Secondly, they do not consider any 
underlying model. Thirdly, the shrinkage term 
is determined via cross-validation. 

In this paper, we derive the shrinkage terms 
by minimising the mean squared error and de- 
fine regularised PCA (rPCA) in Section [2] We 
also show that rPCA can be presented from a 
Bayesian treatment of the fixed effect model 
([lj . Section [3] shows the efficiency of regular- 
isation through a simulation study in which 
rPCA is set against classical PCA and the 
SURE method. The performance of rPCA is 
illustrated through the recovery of the signal 
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and the graphical outputs (individual and vari- 
able representations). Finally, rPCA is 
performed on a real microarray data set in Sec- 
tion m 



2 Regularised PCA 

2.1 MSE point of view 

2.1.1 Minimising the MSE 

PCA provides an estimator X which is as close 
as possible to X in the least squares sense. 
However, assuming model ([T]), the objective 
is to get an estimator as close as possible to 
the unknown signal X. To achieve such a goal, 
the same principle as in ridge regression is fol- 
lowed. We look for a shrinkage version of the 
maximum likelihood estimator which is as close 
as possible to the true structure. More pre- 
cisely, we look for shrinkage terms 
(0a)s=i,...,min(n-i,p) that minimise: 



/ 



MSE = E 



min(n— l,p) 



£ £ 



s = l 



with Xjj = \f\lui,Vj 



i t j \s — l 



l^ij *ij 



min(n— 



*(«) _ s(») 



s=S+l 




According to equation |T]), for all s > S + 1, 

^ = °> therefore = ... = m in(n-i, P ) = 

0. Thus, the MSE can be written as: 




MSE = E 



Using the orthogonality constraints, for all s j£ 
s', J2i u isUis> = J2j VjsVjs' = 0, the MSE can 



be simplified as follows: 



2 2 



mse = e(^ ]T^ Asu , 

i,j \a=l 



(3) 



Equation ([3| is differentiated with respect to 
<t>s to get: 



Based on the results of Denis and Pa zmanl 



(19991 and Denis and Gower] ( |1996 l, we can 



state that the PCA estimator is asymptotically 
unbiased E(&^ ) = xry . More precisely, they 
studied nonlinear regression models with con- 
straints and focused on bilinear models in an 
analysis of variance framework with two fac- 
tors, called biadditive models. Such models are 
defined as follow: 

s 

Vij = M + a i + Pj + E Tia^'a + £ ij 
s=l 

with £ij ~ Af(0,a 2 ) 



(4) 



where [i is the grand mean, ( ai)i—\ t , and 
{Pj)j=x ... j correspond to the main effect pa- 
rameters and ('YisSjs)s=i,...,s models the in- 
teraction. The least square estimates of the 
multiplicative terms are given by the singular 
value decomposition of the residual matrix of 
the model without interaction. Using the Ja- 
cobians and the Hessians of the response de- 



fined by 


Denis and Gower 


( 1994 1 and recently 


in |Papadopoulo and Lourakis 


( 


2000 


), Denis 


and Gower ( 1996 ) derived the asymptotic bias 



of the response and showed that the response 
estimator is approximately unbiased. In the 
PCA framework, it leads to state that asymp- 



totically E (£i 



E 



(*&') 



-(a) 



Xij and for each dimension 
In addition, we can consider 
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that the variance of Xij is equal to the noise 
variance. Therefore, we estimate V f^J^ by 
average 



the 



V i 



(«) 



variance, 



that 



y J min(n — l;p) ' 

Consequently <fi s can be approximated by: 



( min(n— l;p) ^ ^ 



_(a)2 
X y 



As for all s^s', the dimensions s and s' of X 
are orthogonal, <p s can be written as: 



« j y 



Ej,j (min(ri-l;p) ' 2 +^ 



s (s)2 



Based on equation Q, X^^y ) 2 i s equal to 
d s the variance of the s th dimension of the sig- 
nal. 4> s is then equal to: 



d. 



np 

— • — — tt< 




-V,s = l,...,S 



(5) 



else wise 



The form of the shrinkage term is appealing 
since it corresponds to the ratio of the vari- 
ance of the signal over the total variance (sig- 
nal plus noise) for the s th dimension. 

Remark: This term is obtained using asymp- 
totic results (a 2 — > 0), however 



Denis and Pdzmam (1999) verified that even 



in situations jar from the asymptotic, these ap- 
proximations are reasonable. We also checked 
this assumption through simulations. 

Remark: Models such as model Q are also 
known as Additive Main effects and Multiplica- 
tive Interaction models (AMMI). They are of- 
ten used to analyse genotype- environment data 
in plant breading framework. Considering a ran- 
dom version of such models, 



Cornelius and Crossa (1999) came up with a 



regularisation term which is close to ours. It 
allows them to improve the prediction of the 
yield of genotypes in relation with environments. 



2.1.2 Definition of the regularised PC A 

The shrinkage factor ^ depends on unknown 
quantities. We estimate them by plug-in. The 
total variance of the s th dimension is estimated 
by the eigenvalue X s . The signal variance of the 
s th dimension is estimated by the estimated to- 
tal variance of the s th dimension minus an es- 
timate of the noise variance on the s th dimen- 
sion. Consequently, <f> s is estimated by <f> s = 

A '~ mto ^- lirt — . Regularised PCA (rPCA) is 
thus defined as: 



S / \ np 

( .ll*CA ^2 ( '"iiu,,-!:,./ 



y 



s=l 

s 



A, 



min(n — 

7X, 



rPCA boils down to threshold the first S sin- 
gular values. It can be interpreted as a com- 
promise between hard and soft thresholding. 
Hard thresholding consists in selecting a cer- 
tain number of dimensions S which corresponds 
to classical PCA (equation [2]) whereas soft 
thresholding consists in thresholding all singu- 
lar values with the same amount of shrinkage. 
In rPCA, the s th singular value is less shrunk 
than the (s + l) th one. This can be interpreted 
as granting a greater weight to the first di- 
mensions. This behaviour seems satisfactory. 
Indeed, the first dimensions can be considered 
as more stable and trustworthy than the last 
ones. The regularisation procedure is more re- 
lying on what is less variable. When a 2 is small, 
<j) s is close to 1 and rPCA comes down to per- 
form PCA. When a 2 is high, (f> s is close to 
and the values of x r ^^^ are close to which 
corresponds to the average of the variables (be- 
cause X 1 ^^^ is centered). From a geometrical 
point of view rPCA leads to bring the individ- 
uals closer to the centre of gravity. 

The regularisation procedure requires to es- 
timate the residual variance a 2 . As the maxi- 
mum likelihood estimator is biased, another es- 
timator corresponds to the ratio of the residual 
sum of squares divided by the number of data 
minus the number of independent parameters. 
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These latter are equal to 
P+((nS -S)- ^) + (pS - ^ - s), 
i.e. p parameters for the centering, 
(^(nS — S) — S( - s 2 +1 ' 1 ^ for the centered and or- 
thonormal left singular vectors and 
(pS — * s ( s+1 ) — S ) for the orthonormal right 



1987) with an isotropic noise: 



singular vectors. This number of parameters 
can also be calculated as the trace of the pro- 



jection matrix involved in PCA I 


Candes and 


Tao 


2009 


Josse and Husson 


2011 


). Therefore, 



the residual variance is estimated as: 

S 2 = l|X-X|| 2 

np — p — nS — pS + S 2 + S 



min(n— l;p) 



T—^mmyn- 



A, 



np — p — nS — pS + S 2 + S ^ 

This classical estimator, the residual sum of 
squares divided by the number of data minus 
the number of independent parameters, is still 
biased contrary to many methods. It is unbi- 
ased only asymptotically. This statement has 
been confirmed by simulations. 



2.2 Bayesian points of view 

Regularised PCA has been presented and de- 
fined via the minimisation of the MSE in see- 
However, it is possible to define the 



tion 2.1 



method without any reference to MSE but us- 
ing Bayesian considerations. It is well known 
in regression that there is a link between ridge 
regression and a Bayesian treatment of the re- 
gression model. More precisely, the maximum 
a posteriori of the regression parameters as- 
suming a Gaussian prior for these parameters 



correspond to the ridge estimators ( Hastie et al 



2009 p. 64). In this section we make the link 
between regularised PCA and Bayesian treat- 
ments of PCA. 

2.2.1 Probabilistic PCA model 



The probabilistic PCA (pPCA) model (Roweis 



1998 


Tipping and Bishop 1999 


is a particular 


case of factor analysis model ( 


Bartholomew 



'A/^Is^-A^O^Ip) 



with B 



pxS 



the matrix of unknown coefficients 



and Is and I p the identity matrices of size S 
and p. This model induces a Gaussian distribu- 
tion on the individuals with a specific structure 
of variance-covariance: 

x, - Af(0, S) with £ = BB' + cr'% 

There is an explicit solution for the maximum 
likelihood estimators: 



B = V(A-cr 



(7) 



with V defined as in equation |2]) i.e. as the 
matrix of the first S left singular vectors of X, 
A a diagonal matrix constituted of the first S 
eigenvalues, R-sxs a rotation matrix (usually 
equal to Is) and a 2 estimated as the mean of 
the last eigenvalues. 

This model can be seen as a random ef- 
fect model by opposition to the fixed effect 
model (JTJ), since the structure is random. It 
is more in agreement with cases where PCA 
is performed on sample data such as survey 
data. Indeed, the individuals are independent 
and identically distributed and only consid- 
ered for the information they provide on the 
links between variables. In such studies, it does 
not make sense to consider "estimates" of the 
"individual parameters" since no parameter is 
associated with the individuals, only random 
variables (z^) are. However, estimators of the 
"individual parameters" are usually calculated 
as the expectation of the latent variables given 
the observed variables E(z i |x i ). The calcula- 



tion is detailed in Tipping and Bishop ( 1999 ) 
and leads to: 

Z = XB(B'B + o-Hs)- 1 (8) 
Such estimators are often called BLUP estima- 



tors \ Robinson 1991). Thus, using the maxi- 
mum likelihood estimator of B (equation [7| 
and equation (Isb , it is possible to build a fit- 
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ted matrix as: 

X pPCA = = XB(B'B + aHs)- 1 ^' 

= XV(A - ct 2 I s )5 A _1 (A - cr 2 I s )5V' 

= U(A-(7 2 I S )A-5V' 

since X V = A 1/2 U (given by the SVD of X) 

Therefore, considering the pPCA model leads 
to a fitted matrix of the same form than x r ^ > ^^ 
presented in Section 2.1.2| with the same shrunk 
singular values (A — u 2 Is) A -1 / 2 . However, the 
pPCA model considers individuals as random, 
whereas they are fixed in model ([!]). Never- 
theless, assuming a distribution on Zj, which 
can be considered as the "individual parame- 
ters" , is a way to define constraints on the indi- 
viduals. Therefore, from a conceptual point of 
view, we prefer to consider this random effect 
model as a Bayesian treatment of the fixed ef- 
fect model with a prior distribution on the left 
singular vectors. 



2.2.2 An empirical Bayesian approach 

It is also possible to directly consider an em- 
pirical Bayesian treatment of the fixed effect 
model with a prior distribution on each cell of 
the data matrix per dimension: xvy ~ Af(0, r 2 ). 



From model 
(S) ~'A/"(0,T 2 - 



0- 



X 



min(n— l;p) 



this implies 
2 )- 



that 



tr 



Remark: Even if a maximum likelihood solu- 
tion is available ( equ ation^, it is possi ble to 
use an EM algorithm \Rubin and Thayer 1982) 
to estimate the parameters. The two steps are 
two multiple ridge regressions: 

Step E: Z = XB(B'B + aHs)- 1 
Step M: B = X'Z(Z'Z + <t 2 A- 1 )- 1 

The regularisation strategy can also be seen as 
the introduction of two ridge terms in the two 
following linear regressions leading to the usual 
PCA solution: 

Step E: U = XV(V'V)- 1 
Step M: V = X'U(U'U)- 1 
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The posterior distribution is: 



p(x[f\x 



.(») 



P\ x ij\ x i 3 



P 



■ exp 



V S3 »J J 



■ exp 



2r, 2 



■■ exp 



2K 2 + mln( „ 1 - 1;p ) ^ 2 ) 



: exp 



aW 



» 



27T^f 



The expectation of the posterior distribution 
is obtained combining the likelihood and the 
priors: 

e(x$\x$)=$ s x$ 

with ^, = — 



s min(n — l;p) 

This expectation depends on unknown quan- 
tities. They are estimated by maximising the 

likelihood of (x[f J as a func- 

V / i=l,...,ny'=l,...,p 
tion of r? to obtain: 



1 

rip 



-A., 



1 



min(n — l;p) 



-a 2 



Consequently the shrinkage factor is estimated 

_„ S - (4 A ^min(n L -l; P )^ 2 ) _ ^ s ~ min(n-l;p) 

Tip a 

and also corresponds to the regularisation term 
defined in Section [2. 1.1 1 



Remark: Hoff (2007) also proposed a Bayesian 



treatment of SVD-related models. His aim was, 
among others, to estimate the number of un- 
derlying dimensions. Roughly, his proposition 
consists in putting prior distributions on U, A, 
and V. More precisely, he uses von Mises uni- 
form (Hof]\ \200§il prior for orthonormal ma- 



trices (on the Steifeld manifold /(Chikuse , 2003\ )) 
for U and V and normal priors for the sin- 
gular values. It leads to a prior distribution 
for the structure X. Then he builds a Gibbs 



n(n-l;p)' 



sampler to get drawn from the posterior dis- 
tributions. The posterior expectation of X can 
be used as a punctual estimate. It can also be 
seen as a regularised version of the maximum 
likelihood estimate. However, contrary to the 
previously exposed approach, there is no close 
form expression for the regularisation. 



2.3 Trade-off bias- variance 

The rationale behind rPCA can be illustrated 
on graphical representations. Usually, different 
types of graphical representations are associ- 
ated with PCA depending on whether the left 
and right singular vectors are represented as 
normed to 1 or to their associated singular 
value. In our practice, we represent the individ- 
ual coordinates by UA' and the variable coor- 
dinates by VAi. Therefore, the global shape 
of the individual cloud represents the variance. 
Similarly, in the variable representation, the 
cosine of the angle between two variables can 
be interpreted as the covariance. Since rPCA 
comes down to modify the singular values, it 
will affect both the representation of the indi- 
viduals and of the variables. We focus here on 
the individual representation. 

Data are generated according to model ([T]) 
with an underlying signal X 5xl5 composed of 
5 individuals and 15 variables in two dimen- 
sions. Then, 300 matrices are generated with 
the same underlying structure: X s "™ = X5 X i5 + 
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e s i m with sim = 1, ...,300. On each data ma- 
trix, PCA and rPCA are performed. In fig- 
ure [lj the configurations of the 5 individuals 
obtained after each PCA are represented on 
the left, whereas the configurations obtained 
after each rPCA, are represented on the right. 
The average configurations over the 300 sim- 
ulations are represented by triangles and the 
true individual configuration obtained from the 
PCA of X is represented by big dots. Repre- 
senting several sets of coordinates from differ- 
ent PCAs can suffer from translation, reflec- 
tion, dilatation or rotation ambiguities. That 
is why all configurations are superimposed us- 



ing Procustes rotations (Gower and Dijkster- 



huis 2004 ) by taking as the reference the true 



individual configuration. 

Compared to PCA, rPCA provides a rep- 
resentation which is more biased because the 
coordinates of the average points (triangles) 
are systematically inferior to the coordinates 
of the true points (big dots). This is expected 
because the regularisation term shrinks the in- 
dividual coordinates towards the origin. In ad- 
dition, as it is clear for individual number 4 
(dark blue), the representation is less variable. 
Figure [T] gives thus a rough idea of the trade-off 
bias- variance. Note that even the PCA repre- 
sentation is biased, but this is also expected 
since E(X) = X only asymptotically as de- 
tailed section [2.1.11 



3 Simulation study 

To assess rPCA, a simulation study is con- 
ducted and rPCA is confronted to classical PCA 
as well as the SURE method proposed by 



Candes et al (2012 1. As explained in the in- 



troduction, the SURE method relies on a soft 
thresholding strategy: 



.SURE 



i(n,p) 



(VaT - a j u is vj S , 



The shrinkage parameter A is automatically 
selected by minimising Stein's Unbiased Risk 
Estimate (SURE). As a tuning parameter, the 



SURE method does not require the number 
of underlying dimensions of the signal, but it 
does require to estimate the noise variance a 2 
to determine A. 



3.1 Recovery of the signal 

Data are simulated according to model ([!]). 
The structure is simulated by varying several 
parameters: 

— the number of individuals n and the num- 
ber of variables p based on 3 different pairs: 
(n = 100 and p = 20; n = 50 and p = 50; 
n = 20 and p = 100) 

— the number of underlying dimensions S (2; 
4) 

— the ratio of the first eigenvalue on the sec- 
ond eigenvalue (di/d 2 ) of X (4; 1). When 
the number of underlying dimensions is 
higher than 2, the subsequent eigenvalues 
are roughly of the same order of magnitude. 

More precisely, X is simulated as follows: 

1. A SVD is performed on a n x S matrix gen- 
erated from a standard multivariate normal 
distribution. The left singular vectors pro- 
vide S empirically orthonormal vectors. 

2. Each vector s : 1, S is replicated to ob- 
tain the p variables. The number of times 
that each vector s is replicated depends on 
the ratio between the eigenvalues ((ii/G^)- 
For instance, if p = 50, S = 2, (di/tfe) = 4, 
the first vector is replicated 40 times and 
the second vector is replicated 10 times. 

Then, to generate the matrix X, a Gaussian 
isotropic noise is added to the structure. Dif- 
ferent levels of variance a 1 are considered to 



obtain three signal-to-noise ratios (Mazumder 



et al 2010 ) equal to 4, 1 and 0.8. A high signal- 



to-noise ratio (SNR) implies that the variables 
of X are very correlated, whereas a low SNR 
implies that the data are very noisy. For each 
combination of the parameters, 500 data sets 
are generated. 



Regularised PCA to denoise and visualise data 



9 



PCA 



rPCA 



-6 



-1 1- 

-2 

Axis 1 



-6 



-I 1- 

-2 

Axis 1 



Fig. 1: Superimposition of several configurations of individual coordinates using Procustes rota- 
tions towards the true individual configuration of X 5xl5 (big dots). Configurations of the PCA 
(left) and the rPCA (right) of each X = X + e S i m , with sim — 1, ...,300 are represented with 
small dots. The average configuration over the 300 configurations is represented by triangles. 



To assess the recovery of the signal, the 
MSE is calculated between X obtained from 
each method and X. The fitted matrices ob- 
tained by PCA and rPCA are obtained con- 
sidering the true number of underlying dimen- 
sions as known. The SURE method is per- 
formed with the true noise variance as in 



Candes et al (2012). Results of the simulation 



study are gathered in Table [T] 

First, rPCA outperforms both PCA and 
the SURE method in almost all the situations. 
As expected, the MSE obtained by PCA and 
rPCA are roughly of the same order of magni- 
tude when the SNR is high (SNR = 4) whereas 
rPCA is better than PCA when data are noisy 
(SNR = 0.8). The differences between rPCA 
and PCA are also more important when the 
ratio (di/tfe) is high than when the eigenval- 
ues are equal. When (di/tfc) is large, the signal 
is concentrated on the first dimension whereas 
it is scattered in more dimensions when the 
ratio is smaller. Consequently, a same amount 
of noise has a greater impact on the second 
dimension in the first case. This may increase 
the advantage of rPCA which tends to reduce 
the impact of noise. 



The main characteristic of the SURE 
method observed in all the simulations is that 
it gives particularly good results when the data 
are very noisy. Consequently, the results are 
satisfactory when SNR = 0.8 and this is all 
the more true than the number of underly- 
ing dimensions is high. This behaviour can be 
explained by the fact that a same amount of 
signal is more impacted by the by the noise 
if the signal is scattered on many dimensions. 
This remark highlights the fact that the SNR 
is not necessarily a good measure of the level 
of noise in a data set. In addition, the results 
of the SURE method are quite poor when the 
SNR is high. This can be explained by the fact 
that the SURE method takes into account too 
many dimensions (since all the singular val- 
ues greater than the threshold are kept) in 
the estimation of X ou ^ . For example, with 
n = 100, p = 20, S = 2, SNR = 4 and 
(di/d 2 ) = 4 (first row of Table [l), the SURE 
method considers between 9 and 13 dimensions 
to estimate X SURE . 

Finally, the behaviour regarding the ratio 
(n/p) is worth commenting. The MSEs are in 
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Table 1: Mean Squared Error (and its standard deviation) between X and X for PCA, rPCA and 
SURE method over 500 simulations. Results are given for different numbers of individuals (n), 
numbers of variables (p), numbers of underlying dimensions (S), signal-to- noise ratios (SNR) 
and ratios of the first eigenvalue on the second eigenvalue (di/c^)- 



n 


P 


s 


SNR 


(di/d 2 ) 




PCA ,X) 


A/SB(X r 


PCA ,X) 


J/S£(X 


SURE -j^j 


100 


20 


2 


4 


4 


4.22E-04 


(1.69E-06) 


4.22E-04 


(1.69E-06) 


8.17E-04 


(2.67E-06) 


100 


20 


2 


4 


1 


4.21E-04 


(1.75E-06) 


4.21E-04 


(1.75E-06) 


8.26E-04 


(2.89E-06) 


100 


20 


2 


1 


4 


1.26E-01 


(5.29E-04) 


1.08E-01 


(4.56E-04) 


1.60E-01 


(6.15E-04) 


100 


20 


2 


1 


1 


1.23E-01 


(5.05E-04) 


1.11E-01 


(4.61E-04) 


1.69E-01 


(6.28E-04) 


100 


20 


2 


0.8 


4 


3.34E-01 


(1.38E-03) 


2.40E-01 


(9.90E-04) 


3.10E-01 


(1.05E-03) 


100 


20 


2 


0.8 


1 


3.12E-01 


(1.38E-03) 


2.45E-01 


(1.10E-03) 


3.32E-01 


(1.22E-03) 


100 


20 


4 


4 


4 


8.25E-04 


(2.39E-06) 


8.24E-04 


(2.38E-06) 


1.42E-03 


(3.54E-06) 


100 


20 


4 


4 


1 


8.26E-04 


(2.38E-06) 


8.25E-04 


(2.38E-06) 


1.43E-03 


(3.48E-06) 


100 


20 


4 


1 


4 


2.60E-01 


(8.44E-04) 


1.96E-01 


(6.51E-04) 


2.43E-01 


(6.84E-04) 


100 


20 


4 


1 


1 


2.47E-01 


(7.16E-04) 


2.04E-01 


(5.99E-04) 


2.62E-01 


(6.94E-04) 


100 


20 


4 


0.8 


4 


7.41E-01 


(2.69E-03) 


4.27E-01 


(1.53E-03) 


4.36E-01 


(1.11E-03) 


100 


20 


4 


0.8 


1 


6.68E-01 


(2.02E-03) 


4.40E-01 


(1.40E-03) 


4.83E-01 


(1.33E-03) 


50 


50 


2 


4 


4 


2.81E-04 


(1.32E-06) 


2.81E-04 


(1.32E-06) 


5.95E-04 


(2.24E-06) 


50 


50 


2 


4 


1 


2.79E-04 


(1.24E-06) 


2.79E-04 


(1.24E-06) 


5.93E-04 


(2.21E-06) 


50 


50 


2 


1 


4 


8.48E-02 


(4.09E-04) 


7.82E-02 


(3.85E-04) 


1.26E-01 


(4.97E-04) 


50 


50 


2 


1 


1 


8.21E-02 


(3.87E-04) 


7.77E-02 


(3.70E-04) 


1.31E-01 


(5.08E-04) 


50 


50 


2 


0.8 


4 


2.30E-01 


(1.12E-03) 


1.93E-01 


(9.64E-04) 


2.55E-01 


(1.01E-03) 


50 


50 


2 


0.8 


1 


2.14E-01 


(9.58E-04) 


1.89E-01 


(8.57E-04) 


2.73E-01 


(1.07E-03) 


50 


50 


4 


4 


4 


5.48E-04 


(1.84E-06) 


5.48E-04 


(1.84E-06) 


1.04E-03 


(2.82E-06) 


50 


50 


4 


4 


1 


5.46E-04 


(1.76E-06) 


5.46E-04 


(1.76E-06) 


1.04E-03 


(2.79E-06) 


50 


50 


4 


1 


4 


1.75E-01 


(6.21E-04) 


1.53E-01 


(5.54E-04) 


2.00E-01 


(5.79E-04) 


50 


50 


4 


1 


1 


1.68E-01 


(5.49E-04) 


1.52E-01 


(5.08E-04) 


2.09E-01 


(6.04E-04) 


50 


50 


4 


0.8 


4 


5.07E-01 


(1.90E-03) 


3.87E-01 ( 


1.53E-03) 


3.85E-01 


(1.12E-03) 


50 


50 


4 


0.8 


1 


4.67E-01 


(1.62E-03) 


3.76E-01 


(1.38E-03) 


4.13E-01 


(1.23E-03) 


20 


100 


2 


4 


4 


4.22E-04 


(1.72E-06) 


4.22E-04 


(1.72E-06) 


8.15E-04 


(2.80E-06) 


20 


100 


2 


4 


1 


4.21E-04 


(1.69E-06) 


4.20E-04 


(1.70E-06) 


8.20E-04 


(2.89E-06) 


20 


100 


2 


1 


4 


1.25E-01 


(5.35E-04) 


1.06E-01 


(4.53E-04) 


1.57E-01 


(5.83E-04) 


20 


100 


2 


1 


1 


1.22E-01 


(5.28E-04) 


1.10E-01 


(4.76E-04) 


1.67E-01 


(6.20E-04) 


20 


100 


2 


O.S 


4 


3.30E-01 


(1.43E-03) 


2.35E-01 


(1.03E-03) 


3.06E-01 


(1.13E-03) 


20 


100 


2 


0.8 


1 


3.18E-01 


(1.30E-03) 


2.50E-01 


(1.03E-03) 


3.34E-01 


(1.25E-03) 


20 


100 


4 


4 


4 


8.28E-04 


(2.38E-06) 


8.27E-04 


(2.39E-06) 


1.41E-03 


(3.64E-06) 


20 


100 


4 


4 


1 


8.29E-04 


(2.58E-06) 


8.28E-04 


(2.58E-06) 


1.42E-03 


(3.68E-06) 


20 


100 


4 


1 


4 


2.55E-01 


(7.59E-04) 


1.97E-01 


(5.92E-04) 


2.45E-01 


(6.47E-04) 


20 


100 


4 


1 


1 


2.48E-01 


(7.45E-04) 


2.04E-01 


(6.20E-04) 


2.60E-01 


(6.91E-04) 


20 


100 


4 


0.8 


4 


7.13E-01 


(2.55E-03) 


4.15E-01 


(1.47E-03) 


4.37E-01 


(1.19E-03) 


20 


100 


4 


0.8 


1 


6.66E-01 


(2.01E-03) 


4.34E-01 


(1.31E-03) 


4.78E-01 


(1.24E-03) 



the same order of magnitude for (n/p) = 0.2 
and (n/p) = 5 and much smaller for (n/p) = 1 
for all the methods. The issue of dimensional- 
ity does not lie in the fact that the number of 
variables is much larger than the number indi- 
viduals. However, difficulties are encountered 
when either one mode (n or p) is larger than 
the other one which can be explained by the 
bilinear form of the model. 



3.2 Simulations from Candes et al (2012) 



Regularised PCA is also assessed on the simu- 



lations of Candes et al (2012). Simulated ma- 



trices of size 200 x 500 were drawn with 4 SNR 



(0.5, 1, 2 and 4) and 2 numbers of underlying 
dimensions (10, 100). 

Results for the SURE method (Table ^ 
are in agreement with the results obtained by 
Candes et al ( 2012[ ). As in the first simulation 
study (section 3.1), rPCA outperforms both 
PCA and the SURE method in almost all cases. 
However, the SURE method provides better 
results than rPCA when the number of un- 
derlying dimensions S is high (S — 100) and 
the SNR is small (SNR = 1, 0.5). This is in 
agreement with the previous comments high- 
lighting the ability of the SURE method to 
handle noisy situations. Nevertheless, we can 
note that when the SNR is equal to 0.5, rPCA 
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is performed with the "true" number of under- 
lying dimensions (100) whereas estimating the 
number of underlying dimensions would lead, 
on this data, to conserve dimensions, accord- 
ing to several methods. Data are very noisy 
and the signal can be considered as nearly lost. 
Results obtained with rPCA, taking into ac- 
count dimension, boils down to predict all 
the values by which corresponds to an MSE 
equal to 1. In this case, considering dimen- 
sions in rPCA leads to a lower MSE than tak- 
ing into account 100 dimensions (MSE = 1.48), 
but it is still higher than the MSE of the SURE 
method (0.85). 



The R (R Core Team, 20121 code to per- 



form all the simulations is available on the web 
page of the authors. 



3.3 Recovery of the graphical outputs 

Our objective is also to obtain graphical out- 
puts (individual and variable representations) 
which are as close as possible to the outputs 
obtained from X. Let us consider a toy data 
set with 100 individuals, 20 variables, 2 un- 
derlying dimensions, (di/c^) = 4 and a SNR 
equal to 0.8 (row 5 of Table [TJ. 

Figure [2] provides the true individual rep- 
resentation obtained from X (top left) as well 
as the representations obtained by PCA (top 
right), rPCA (bottom left) and the SURE 
method (bottom right). The cloud associated 
with PCA has a higher variability than the 
cloud associated with rPCA which is tightened 
around the origin. The effect of regularisation 
is stronger on the second axis than on the first 
one, which is expected because of the regu- 
larisation term. For instance, the individuals 
82 and 59 which have small coordinates on 
the second axis in PCA are brought closer to 
the origin in the representation obtained by 
rPCA which is more in agreement with the 
true configuration. The cloud associated with 
the SURE method is tightened around the ori- 
gin on the first axis and even more on the sec- 
ond one, which is also expected because of the 



regularisation term. However the global vari- 
ance of the SURE representation, which is re- 
flected by the variability, is clearly lower than 
the variance of the true signal. Therefore, the 
global shape of the cloud of rPCA is the closest 
to the true one and thus rPCA recovers well 
the distances between individuals. 

Figure [3] provides the corresponding repre- 
sentations for the variables. The link between 
the variables which have high coordinates on 
the first and the second axis of the PCA of X 
is reinforced in rPCA. This is consistent with 
the representation of X. For instance, vari- 
ables 9 and 7 which are correlated to 1 in X 
are not very linked in the PCA representation 
(correlation equal to 0.68) whereas their cor- 
relation equals 0.81 in the rPCA representa- 
tion and 0.82 in the SURE representation. On 
the contrary, variables 20 and 7, orthogonal 
in X, have rather high coordinates, in abso- 
lute value, on the second axis in the PCA rep- 
resentation (correlation equal to -0.60). Their 
link is slightly weakened in the rPCA repre- 
sentation (correlation equal to -0.53) and in 
the SURE representation (correlation equal to 
-0.51). In addition, all the variables are gener- 
ated with a variance equal to 1. The variances 
are over-estimated in the PCA representation 
and under-estimated in the SURE represen- 
tation, particularly for the variables which are 
highly linked to the second axis. The best com- 
promise for the variances is provided by rPCA. 
Therefore, rPCA recovers well the variances 
and the covariances of the variables. 



4 Application to real data sets: 
transcriptome profiling 

Regularised PCA is applied to a real data set 



( Desert et al 2008 ) which consists in a collec- 



tion of 12664 gene expressions in 27 chickens 
submitted to 4 nutritional statuses: continu- 
ously fed (N), fasting for 16 hours (F16), fast- 
ing for 16 hours then refed for 5 hours (F16R5), 
fasting for 16 hours then refed for 16 hours 
(F16R16). 
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Table 2: Mean Squared Error (and its standard deviation) between X and X for PCA, regularised 
PCA (rPCA) and SURE method over 100 simulations. Results are given for n — 200 individuals, 
p = 500 variables, different numbers of underlying dimensions (S) and signal-to-noise ratios 
(SNR). 



S SNR 


MSB(X PCA ,X) 


MS£(X rPCA ,X) 


MSS(X SURE ,X) 


10 4 
10 2 
10 1 
10 0.5 
100 4 
100 2 
100 1 
100 0.5 


4.31E-03 (7.96E-07) 
1.74E-02 (2.84E-06) 
7.16E-02 (1.25E-05) 
3.19E-01 (5.44E-05) 
3.79E-02 (2.02E-06) 
1.58E-01 (8.99E-06) 
7.29E-01 (4.84E-05) 
3.16E+00 (1.65E-04) 


4.29E-03 (7.91E-07) 
1.71E-02 (2.81E-06) 
6.75E-02 (1.15E-05) 
2.57E-01 (4.85E-05) 
3.69E-02 (1.93E-06) 
1.41E-01 (7.98E-06) 
4.91E-01 (2.96E-05) 
1.48E+00 (1.12E-04) 


8.74E-03 (1.15E-06) 
3.29E-02 (4.68E-06) 
1.16E-01 (1.59E-05) 
3.53E-01 (5.42E-05) 
4.50E-02 (2.12E-06) 
1.56E-01 (8.15E-06) 
4.48E-01 (2.26E-05) 
8.52E-01 (3.07E-05) 



Since there are 4 nutritional statuses, 3 di- 
mensions are considered. We expect the first 
three principal components to represent the 
between class variability, whereas the following 
components represent the within class variabil- 
ity which is less of interest. Figure [4] shows the 
individual representations obtained by PCA 
(top left), rPCA (top right) and the SURE 
method (bottom left). To better highlight the 
effect of regularisation, the dimensions 1 and 3 
are represented. The first dimension of PCA, 
rPCA and the SURE method order the nutri- 
tional statuses from the continuously fed chick- 
ens (left) to the fasting chickens (right). The 
chicken N.4 and the chicken F16R5.1 which 
have high coordinates, in absolute value, on 
the third axis of PCA are brought closer to the 
other chickens submitted to the same status 
in the rPCA representation and in the SURE 
representation. In addition, the chickens N.l 
and F16.4 which have high coordinates on the 
first axis are brought closer to the origin in 
the SURE representation. Despite these differ- 
ences, the impact of the regularisation on the 
graphical outputs appears to be small. 

The representation obtained after a sparse 



PCA (sPCA) method flWitten et al||2009|) im 



plemented in the R package PMA (Witten et al 



20111 is also provided (bottom right). Indeed, 



it is very common to use sparse methods on 



this kind of data (Zou et al 2006). The basic 



that PCA provides principal components that 
are linear combinations of the original vari- 
ables which may lead to difficulties during the 
interpretation especially when the number of 
variables is huge. Loadings obtained via sPCA 
are indeed sparse which means they contain 
many elements which comes down to select 
variables. The representation stemming from 
sPCA is quite different from the other repre- 
sentations, particularly the clusters of F16R5 
and of F16 chickens are less clearly differenti- 
ated. 

It is usual to complement principal com- 



ponents methods with heatmaps (Eisen et al 



1998) in order to simultaneously cluster the 



chickens and the genes. Because rPCA modi- 
fies the distances between chickens as well as 
the covariances between genes, the rPCA 
heatmap will differ from the PCA heatmap. 
The heatmap clustering is applied to the ma- 
trices X obtained by the different methods (Fig- 
ure^. The rPCA heatmap (Figure 5b) is much 
more satisfying than the PCA heatmap (Fig- 
ure 5a). Indeed, the chickens submitted to 16 



assumptions for the development of sPCA is 



hours of fasting are distinguished into two sub- 
clusters in the PCA heatmap separated by the 
chickens F16R5.1, F16R16.3 and F16R16.4, 
whereas they are well-clustered in the rPCA 
heatmap. Similarly the chickens F16R5 are 
gathered in the PCA heatmap except for chick- 
ens F16R5.1 and F16R5.3, whereas they are 
well-clustered in the rPCA heatmap. Finally, 
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Fig. 2: Individual representations of X (top left), of the PCA of X (top right), of the rPCA of X 
(bottom left) and of the SURE method applied to X (bottom right) for a data set with n = 100, 
p = 20, S = 2, (d 1 /d 2 ) = 4 and SNR= 0.8. 



the F16R16 chickens are more scattered in both 
representations. However in rPCA, it can be 
interpreted as some of the chickens having fully 
recovered from the fasting period are mixed 
with continuously fed chickens and some hav- 
ing not fully recovered are mixed with F16R5 
chickens: the large majority of F16R16 chick- 
ens are gathered and mixed with N.6 and N.7, 
and chicken F16R16.1 is mixed with F16R5 



chickens. It is not the case for PCA, where 
the F16R16 chickens are mixed with chickens 
submitted to all the other nutritional statuses. 
The conclusions concerning the SURE heatmap 



(Figure 5c ) are roughly similar to the conclu- 
sions drawn for rPCA. Globally, the 4 clusters 
corresponding to the 4 nutritional statuses are 
well-defined. However, the chicken F16R5.3 is 
clustered with the N chickens. In addition, the 
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Xtilde PCA 




Fig. 3: Variable representations of the PCA of X (top left), the PCA of X (top right), the rPCA 
of X (bottom left) and the SURE method applied to X (bottom right) for an example of data 
set with n = 100, p = 20, S = 2, (d 1 /d 2 ) = 4 and SNR= 0.8. 



global contrasts are weaker in the SURE 
heatmap than in the rPCA heatmap. The 
heatmap stemming from sPCA (Figure 5d| 
seems to be easier to interpret (more contrasts) , 
due to the drastic selection of genes (43 on the 
12664). However none of the chicken clusters 
is clearly defined. 

We will not dwell on the interpretation of 
the gene expressions in the heatmap, however, 



if the chicken clustering is coherent, the gene 
clustering is expected to be more coherent as 
well. 



Conclusion 

When data can be seen as a true signal marred 
by errors, PCA does not provide the best re- 
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Fig. 4: Representation of the individuals on the dimension 1 and 3 of the PCA (top left), the 
rPCA (top right), the SURE method (bottom left) and sPCA (bottom right) of the transcriptome 
profiling data. Individuals are coloured according to the nutritional statuses. 



covery of the underlying signal. Thresholding 
the singular values improves the estimation of 
the underlying structure especially when data 
are noisy. Soft thresholding is one of the most 
popular strategies and consists in linearly 
shrinking the singular values. In this frame- 
work, 



Candes et al (20121 developed an automatic 



strategy to determine the regularisation pa- 
rameter. The regularised PCA proposed in this 
paper applies a nonlinear transformation of the 
singular values and a hard thresholding rule. 



The regularised term is analytically derived 
from the MSE using asymptotic results from 
nonlinear regression models. In the simulations, 
rPCA outperforms the SURE method in most 
of the situations. 

However, rPCA requires a tuning parame- 
ter which is the number of underlying dimen- 
sions. Many methods are available in the liter- 
ature to select this parameter. However, it is 
still a difficult problem and an active research 
area. A classical statement is the following one: 
if the selected number of dimensions is smaller 
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(a) PCA 



(b) rPCA 




(c) SURE 




(d) sPCA 



Fig. 5: Heatmaps associated with the analysis of the transcriptomic data set. The data sets used 
to perform the heatmaps are the fitted matrices stemming from PCA (a), rPCA (b), the SURE 
method (c) and sPCA (d). 



than the rank S of the signal, a part of the rel- 
evant information is lost and, in our situation, 
it results in overestimating the noise variance. 
On the contrary, selecting more than S di- 



taken into account even if the noise variance is 
underestimated. However, the noise may over- 
whelm the signal which lead to "loose" the di- 
mensions of the signal which have the small- 



mensions seems better because all the signal is est variability. In such a case, it is better to 



Regularised PCA to denoise and visualise data 



17 



select a number of dimensions smaller than 
S. This strategy is a way to regularise more 
which is acceptable when data are very noisy. 
In practice, we use a cross-validation strategy 



Josse and Husson ( 201 lh which has a good be- 



haviour in our simulations (find the true num- 
ber of dimensions when the signal-to-noise ra- 
tio is large, and find a smaller number when 
the signal-to- noise ratio is small). 

The graphical outputs provided by rPCA 
are closer to the ones that would be obtained 
from the signal only since rPCA improves the 
recovery of the signal compared to classical 
PCA. However, the graphical representations 
are biased. Another solution could be to keep 
the graphical representations obtained from 
PCA and to enrich them through confidence 
areas. Such areas may be constructed using a 
residual bootstrap procedure. 
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